Project: Investigate a Dataset - [Movie Dataset]¶

Table of Contents¶

  • Introduction
  • Data Wrangling
  • Exploratory Data Analysis
  • Conclusions

Introduction¶

in this report we'll analyze a dataset from Movie Database API. The dataset contains information about movies, including details like the budget,overview and revenues.main purpose of this report is we want to look for the trends in the film industry and identify resons behind the successful of a movie.

Dataset Description¶

1: Dataset Source
*was obtained from TMDBb API

2: Dataset Size
*the data contains 10866 rows and 20 columns.

3: DataSet concerns
*there are movies don't contain the released in a given year.
*there are some missing data in the dataset.

Questions for Analysis¶

Is there a relationship between movie budgets and revenues across different genres?

Understanding the relationship between budgets and revenues can provide insights into the financial aspects of movie production. By analyzing this relationship across genres, we can determine if certain genres are more financially successful than others, and if budgeting strategies should vary based on genre.

How has the popularity of different genres changed over the years?

Examining popularity trends can reveal changes in audience preferences over time. This analysis can help filmmakers and studios understand which genres are currently trending and adjust their production strategies accordingly to cater to audience interests.

How has the average runtime of movies changed over the years?

Changes in movie runtime can reflect shifts in storytelling styles, audience attention spans, and production trends. Analyzing this data can provide insights into evolving cinematic practices and audience expectations regarding movie length.

Data Wrangling¶

In [1]:
import pandas as pd

def wrangling(file_path):
    # Load the dataset
    df = pd.read_csv(file_path)

    # Display the first few rows of the dataset
    print(df.head())

    # Display the info of the dataset to understand its structure
    print(df.info())

    # Display summary statistics of the dataset
    print(df.describe())

    # Check for missing values
    print(df.isnull().sum())

    return df

# Call the function with the file path
df = wrangling('Database_TMDb_movie_data/tmdb-movies.csv')
       id    imdb_id  popularity     budget     revenue  \
0  135397  tt0369610   32.985763  150000000  1513528810   
1   76341  tt1392190   28.419936  150000000   378436354   
2  262500  tt2908446   13.112507  110000000   295238201   
3  140607  tt2488496   11.173104  200000000  2068178225   
4  168259  tt2820852    9.335014  190000000  1506249360   

                 original_title  \
0                Jurassic World   
1            Mad Max: Fury Road   
2                     Insurgent   
3  Star Wars: The Force Awakens   
4                     Furious 7   

                                                cast  \
0  Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...   
1  Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...   
2  Shailene Woodley|Theo James|Kate Winslet|Ansel...   
3  Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...   
4  Vin Diesel|Paul Walker|Jason Statham|Michelle ...   

                                            homepage          director  \
0                      http://www.jurassicworld.com/   Colin Trevorrow   
1                        http://www.madmaxmovie.com/     George Miller   
2     http://www.thedivergentseries.movie/#insurgent  Robert Schwentke   
3  http://www.starwars.com/films/star-wars-episod...       J.J. Abrams   
4                           http://www.furious7.com/         James Wan   

                         tagline  ...  \
0              The park is open.  ...   
1             What a Lovely Day.  ...   
2     One Choice Can Destroy You  ...   
3  Every generation has a story.  ...   
4            Vengeance Hits Home  ...   

                                            overview runtime  \
0  Twenty-two years after the events of Jurassic ...     124   
1  An apocalyptic story set in the furthest reach...     120   
2  Beatrice Prior must confront her inner demons ...     119   
3  Thirty years after defeating the Galactic Empi...     136   
4  Deckard Shaw seeks revenge against Dominic Tor...     137   

                                      genres  \
0  Action|Adventure|Science Fiction|Thriller   
1  Action|Adventure|Science Fiction|Thriller   
2         Adventure|Science Fiction|Thriller   
3   Action|Adventure|Science Fiction|Fantasy   
4                      Action|Crime|Thriller   

                                production_companies release_date vote_count  \
0  Universal Studios|Amblin Entertainment|Legenda...       6/9/15       5562   
1  Village Roadshow Pictures|Kennedy Miller Produ...      5/13/15       6185   
2  Summit Entertainment|Mandeville Films|Red Wago...      3/18/15       2480   
3          Lucasfilm|Truenorth Productions|Bad Robot     12/15/15       5292   
4  Universal Pictures|Original Film|Media Rights ...       4/1/15       2947   

   vote_average  release_year    budget_adj   revenue_adj  
0           6.5          2015  1.379999e+08  1.392446e+09  
1           7.1          2015  1.379999e+08  3.481613e+08  
2           6.3          2015  1.012000e+08  2.716190e+08  
3           7.5          2015  1.839999e+08  1.902723e+09  
4           7.3          2015  1.747999e+08  1.385749e+09  

[5 rows x 21 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
None
                  id    popularity        budget       revenue       runtime  \
count   10866.000000  10866.000000  1.086600e+04  1.086600e+04  10866.000000   
mean    66064.177434      0.646441  1.462570e+07  3.982332e+07    102.070863   
std     92130.136561      1.000185  3.091321e+07  1.170035e+08     31.381405   
min         5.000000      0.000065  0.000000e+00  0.000000e+00      0.000000   
25%     10596.250000      0.207583  0.000000e+00  0.000000e+00     90.000000   
50%     20669.000000      0.383856  0.000000e+00  0.000000e+00     99.000000   
75%     75610.000000      0.713817  1.500000e+07  2.400000e+07    111.000000   
max    417859.000000     32.985763  4.250000e+08  2.781506e+09    900.000000   

         vote_count  vote_average  release_year    budget_adj   revenue_adj  
count  10866.000000  10866.000000  10866.000000  1.086600e+04  1.086600e+04  
mean     217.389748      5.974922   2001.322658  1.755104e+07  5.136436e+07  
std      575.619058      0.935142     12.812941  3.430616e+07  1.446325e+08  
min       10.000000      1.500000   1960.000000  0.000000e+00  0.000000e+00  
25%       17.000000      5.400000   1995.000000  0.000000e+00  0.000000e+00  
50%       38.000000      6.000000   2006.000000  0.000000e+00  0.000000e+00  
75%      145.750000      6.600000   2011.000000  2.085325e+07  3.369710e+07  
max     9767.000000      9.200000   2015.000000  4.250000e+08  2.827124e+09  
id                         0
imdb_id                   10
popularity                 0
budget                     0
revenue                    0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
dtype: int64

Data Cleaning¶

let's start cleaning our dataset:
After we collect some information about our dataset now it's time to make the dataset more clear and clean

In [10]:
def clean_data(df):
    # Fill missing values
    df['homepage'] = df['homepage'].fillna('Not available')
    df['tagline'] = df['tagline'].fillna('')
    df['keywords'] = df['keywords'].fillna('')
    df['overview'] = df['overview'].fillna('')
    
    # Drop rows with missing values in critical columns
    df = df.dropna(subset=['genres', 'production_companies', 'director', 'cast', 'imdb_id'])

    # Reset index
    df.reset_index(drop=True, inplace=True)

    return df

# Clean the data
df_cleaned = clean_data(df)

Genre Distribution

The bar chart below shows the distribution of movies by genre. We can observe which genres are most prevalent in the dataset, providing an understanding of the genre composition of the movies we are analyzing.
see the below chart how the observation for the geners are not observable due to the high genres so in our report we want to clearify these observation

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns


# Count the occurrences of each genre
genre_counts = df_cleaned['genres'].value_counts().reset_index()
genre_counts.columns = ['genres', 'count']

# Plot the distribution of movie genres
plt.figure(figsize=(14, 8))
sns.barplot(data=genre_counts, x='genres', y='count', palette='viridis')

plt.xlabel('Genres')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Genres')

# Rotate x-axis labels
plt.xticks(rotation=45, ha='right')

plt.tight_layout(pad=3.0)
plt.show()
No description has been provided for this image

Popularity Distribution

The histogram below shows the distribution of popularity scores across all movies in the dataset. This gives us an idea of how popularity is spread among the movies, highlighting any skewness or outliers in the data.

In [4]:
plt.figure(figsize=(12, 6))
sns.histplot(df['popularity'], bins=30, kde=True)
plt.xlabel('Popularity')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Popularity')
plt.tight_layout()
plt.show()
No description has been provided for this image

The bar chart we notice previously of genre distribution shows that genres such as Drama,Thriler, Actionand Comedy are the most common in the dataset. This indicates that these genres are popular choices for movie production.

Exploratory Data Analysis¶

After we noticed the popularity and genres distribution now we can start analysis our questions:

Research Question 1¶

Is there a relationship between movie budgets and revenues across different genres?

In [5]:
# Group by genre and calculate the average revenue and budget
genre_stats = df_cleaned.groupby('genres')[['revenue', 'budget']].mean().reset_index()

# Plot the correlation
plt.figure(figsize=(14, 8))
sns.scatterplot(data=genre_stats, x='budget', y='revenue', hue='genres', palette='tab20', s=100)
plt.xlabel('Average Budget')
plt.ylabel('Average Revenue')
plt.title('Average Revenue vs. Average Budget by Genre')
# plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# plt.tight_layout()
plt.show()
No description has been provided for this image

the previous plot show us how the avg. budget and avg. revenue differ with the genre

In [6]:
import matplotlib.pyplot as plt
import seaborn as sns


def calculate_genre_stats(df_split):
    return df_split.groupby(['genre', 'release_year'])['popularity'].mean().reset_index()


def calculate_genre_stats3(df_split):
    return df_split.groupby('genre').agg({'revenue': 'mean', 'budget': 'mean'}).reset_index()


def plot_genre_popularity(genre_popularity):
    plt.figure(figsize=(16, 8))
    sns.lineplot(data=genre_popularity, x='release_year', y='popularity', hue='genre', palette='tab20', linewidth=2.5)
    plt.xlabel('Release Year')
    plt.ylabel('Average Popularity')
    plt.title('Average Popularity of Different Genres Over the Years')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()
    
    
 

Research Question 2¶


How has the popularity of different genres changed over the years?

In [7]:
import matplotlib.pyplot as plt
import seaborn as sns

# Group by genre and release year, and calculate the average popularity
genre_popularity = df_cleaned.groupby(['genres', 'release_year'])['popularity'].mean().reset_index()

# Plotting the average popularity of each genre over the years
plt.figure(figsize=(14, 8))
sns.lineplot(data=genre_popularity, x='release_year', y='popularity', hue='genres', palette='tab20', marker='o')

plt.xlabel('Release Year')
plt.ylabel('Average Popularity')
plt.title('Average Popularity of Each Genre Over the Years')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Genres')
plt.tight_layout(pad=3.0)
plt.show()
/tmp/ipykernel_13/3067001640.py:15: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations.
  plt.tight_layout(pad=3.0)
No description has been provided for this image

this shows us How the Popularity of Different Genres Changed Over the Years?

The line plot of average popularity over the years reveals trends in how different genres have gained or lost popularity. For instance, genres such as Action and Adventure have maintained high popularity, while others may have seen fluctuations.
and they are really related to our previous plot so the results are convience for sure.

Research Question 3¶


How has the average runtime of movies changed over the years?

In [8]:
# Calculate the average runtime for each release year
avg_runtime = df_cleaned.groupby('release_year')['runtime'].mean()

# Plot the average runtime
plt.figure(figsize=(12, 6))
plt.bar(avg_runtime.index, avg_runtime.values, color='skyblue')
plt.xlabel('Release Year')
plt.ylabel('Average Runtime')
plt.title('Average Runtime Over Years')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
No description has been provided for this image

the previous plot show us how people prefer the old years

¶

Conclusions¶

In this analysis of the TMDb movie dataset, we explored the relationship between movie genres and various aspects such as popularity, revenue, and budget. Here are the key findings:

Genres and Popularity

We observed that certain genres, such as Action, Adventure, and Science Fiction , tended to be more popular among audiences over the years. These genres consistently attracted high levels of viewership.

Common Genres

The most common genres in the dataset were Drama, Comedy,Thriller and Action. These genres were prevalent across a wide range of movies in the dataset.

Genres and Revenue/Budget

While there was some variation, we found that genres such as Adventure, Fantasy, and Animation tended to have higher average revenues and budgets compared to other genres. This suggests a potential correlation between these genres and financial success.

Overall Insights

Overall, our analysis suggests that certain genres have a significant impact on the success of a movie, both in terms of audience engagement and financial performance. Understanding these trends can help filmmakers and studios make informed decisions about the types of movies to produce.

Limitations

It's important to note that our analysis has some limitations. The dataset may not be fully representative of all movies released, and there may be other factors beyond genre that influence a movie's success.

Relationship Between Movie Budgets and Revenues Across Different Genres

The analysis revealed varying relationships between budgets and revenues across different genres. While some genres, such as Adventure and Animation, tend to have higher average revenues, they also require larger budgets. On the other hand, genres like Documentary and Horror show lower budget requirements but can still achieve significant revenues. This suggests that budgeting strategies should be tailored to the specific genre to maximize financial success. Popularity Trends of Different Genres Over the Years:

The popularity trends of genres have fluctuated over the years, indicating shifting audience preferences. Certain genres, such as Adventure and Science Fiction, have shown consistent popularity, while others, like Western and War, have experienced declines. These trends highlight the importance of adapting to changing audience tastes and producing content that aligns with current trends to maintain audience engagement. Changes in Average Runtime Over the Years:

The analysis of average movie runtimes over the years revealed a gradual decline in recent decades. This trend suggests a shift towards shorter movie lengths, possibly driven by changes in audience preferences and viewing habits. Filmmakers may need to consider these trends when planning the length of their movies to cater to modern audience expectations.

Future Work

Future work could explore additional factors that contribute to a movie's success, such as the impact of specific actors or directors, as well as regional variations in genre preferences.

In [9]:
# Running this cell will execute a bash command to convert this notebook to an .html file
!python -m nbconvert --to html Investigate_a_Dataset.ipynb
[NbConvertApp] Converting notebook Investigate_a_Dataset.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 4 image(s).
[NbConvertApp] Writing 10220760 bytes to Investigate_a_Dataset.html
In [ ]: